Regression Methods in the Empiric Analysis of Health Care Data

OBJECTIVES: The aim of this paper is to provide health care decision makers with a conceptual foundation for regression analysis by describing the principles of correlation, regression, and residual assessment. SUMMARY: Researchers are often faced with the need to describe quantitatively the relationships between outcomes and predictors, with the objective of explaining trends, testing hypotheses, or developing models for forecasting. Regression models are able to incorporate complex mathematical functions and operands (the variables that are manipulated) to best describe the associations between sets of variables. Unlike many other statistical techniques, regression allows for the inclusion of variables that may control for confounding phenomena or risk factors. For robust analyses to be conducted, however, the assumptions of regression must be understood and researchers must be aware of diagnostic tests and the appropriate procedures that may be used to correct for violations in model assumptions. CONCLUSIONS: Despite the complexities and intricacies that can exist in regression, this statistical technique may be applied to a wide range of studies in managed care settings. Given the increased availability of data in administrative databases, the application of these procedures to pharmacoeconomics and outcomes assessments may result in more varied and useful scientific investigations and provide a more solid foundation for health care decision making.

esearchers from a wide range of disciplines routinely use regression analyses to understand the mathematical relationships between variables, for purposes of description, hypothesis testing, and prediction. 1 Regression extends the capability provided by other statistical procedures by quantifying the level (amount) of change in the outcome or dependent variable that would be expected based upon a given level of change in a predictor or independent variable.
Applied to health care, particularly in managed care settings where administrative claims or ambulatory care databases are readily available, re g ression techniques may be useful in conducting pharmacoeconomic or outcomes analyses. [2][3][4][5] These methods provide the ability to assess cost functions and tre a tment effectiveness while simultaneously controlling for potential confounding factors such as comorbidities, prior health care utilization, and demographic characteristics. Researchers are afforded the ability to understand and describe the association of various predictor variables with phenomena such as the level of demand or consumption of medical services or changes in clinical outcomes for a given therapeutic intervention.
Regression is routinely applied to many aspects of research and re p o rting. As such, the objective of this paper is to establish a conceptual foundation for understanding re g ression techniques, to assist in their application and understanding within the context of health care decision making. Corre l a t i o n is introduced initially because it serves as a major tenet of basic regression. Following this is a presentation of regression, including the diagnostic tools needed for sound interpretation. A l t h o u g h b road extensions and complexities exist in a discussion of the topic, this article centers on the transparent aspects of regression analysis and then presents general considerations that may be useful for researchers wishing to incorporate these methods into practice within managed care settings.

■ ■ Correlation
Empirical investigations often seek to gain an understanding of the associations between variables. Specifically, the strength of a linear association between 2 variables is termed correlation. Although there are several measures of correlation, the most commonly reported is the Pearson product-moment correlation coefficient (typically abbreviated with a lowercase "r"). 6 The Pearson correlation is an expression of the association between 2 continuous variables; other correlations may be reported a c c o rding to differing characteristics of the variables (e.g., Spearman's rho, which is a correlation coefficient for ranked variables). Correlation coefficients measure 2 facets of a relationship: (1) the magnitude (with absolute coefficient values ranging from zero to one) and (2) the direction (either positive ABSTRACT OBJECTIVE: The aim of this paper is to provide health care decision makers with a conceptual foundation for regression analysis by describing the principles of correlation, regression, and residual assessment. SUMMARY: Researchers are often faced with the need to describe quantitatively the relationships between outcomes and pre d i c t o r s , with the objective of ex p l a i n i n g trends, testing hypotheses, or developing models for forecasting. Regression models are able to incorporate complex mathematical functions and operands (the variables that are manipulated) to best describe the associations between sets of variables. Unlike many other statistical techniques, regression allows for the inclusion of variables that may control for confounding phenomena or risk factors. For robust analyses to be conducted, however, the assumptions of regression must be understood and researchers must be aware of diagnostic tests and the appropriate procedures that may be used to correct for violations in model assumptions.
C O N C L U S I O N : Despite the complexities and intricacies that can exist in re gre s s i o n , this statistical technique may be applied to a wide range of studies in managed care settings. Given the increased availability of data in administrative databases, the application of these procedures to pharmacoeconomics and outc o m e s assessments may result in more varied and useful scientific investigationsand provide a more solid foundation for health care decision making. or negative). For the Pearson correlation, the larger the absolute value, the stronger the linear association: a correlation of -1.0 indicates a perfectly negative linear association, 0.0 indicates no linear association, and +1.0 indicates a perfectly positive linear association. 7 Illustrating the overall concept of correlation, Figure 1 is a graphical representation of hypothetical scatter plots of data for 2 variables.
Example. A positive correlation may exist between a patient' s overall health care expenditures and the number of physician visits or hospital stays incurred (i.e., expenditures increase as a function of demand for medical services). Conversely, a negative correlation may exist between the number of errors occurring in practice and the level of quality assurance measures implemented in the workplace (i.e., errors would be expected to decrease as the number of quality assurance measures increased). It should be emphasized that the Pearson correlation measure s a l i n e a r association between 2 variables. Thus, in the pre s e n c e of skewed distributions of data, curvilinear functions between variables, or instances with extreme outliers, the correlation coefficient may not be a reliable indicator of the presence of a relationship. For example, if the correlation between 2 variables is found to equal zero, this does not preclude the possibility that a relationship between the variables exists. 7 For this reason, although statistical software packages routinely re p o rt corre l a t i o n s and related tests of statistical significance, graphical analyses of scatter plots similar to those shown in Figure 1 should be reviewed when ascertaining the validity of any given value for a correlation.
Correlation is often viewed as a precursor to regression techniques because the 2 topics are conceptually related. However, clear differences between correlation and regression exist. Whereas correlation centers upon expressing the strength of a linear association between variables, regression involves the estimation of a mathematical equation that is based upon a theoretical specification (selecting an appropriate functional form, or model, that includes all relevant variables). Thus, correlation is not considered to be dependent upon the scales of m e a s u re m e n t between variables, whereas altering the units of measure in regression may change the interpretation of the results dramatically. 8 In regression analysis, statistical tools based on the principle of correlation are used to assess regression results. The coefficient of determination is particularly important because it re p re s e n t s the amount of variability of a d e p e n d e n t (or outcome) v a r i a b l e explained by an i n d e p e n d e n t (synonymous with pre d i c t o r o r e x p l a n a t o ry) variable. 9 Being the squared term of the corre l a t i o n coefficient, the coefficient of determination is often referred to simply as "r-squared" ("r 2 "). The mathematical properties of this statistic dictate values that range from 0 to 1 and are always less in magnitude than the correlation coefficient. In instances involving more than 2 variables, the coefficient of multiple determination, or "R-squared" ("R 2 "), is often reported. The adjusted R-squared coefficient may also be reported, and corrects for the number of independent variables that are specified in a re g re s s i o n equation when more than one predictor variable is being analyzed. The R 2 coefficient provides useful information to researchers in assessing the overall "goodness of fit" of a given regression equation-a higher value indicates a better fit. 1 R e s e a rchers are also encouraged to analyze the overall F statistic from regression analyses in order to assess the statistical significance of a regression equation. Regardless of the magnitude or statistical significance of a correlation, re s e a rchers must continually be aware that only an association between variables is being m e a s u red. These statistical measures re p resent only one available means of assessing re g ression results, and their interpre t a t i o n must be made within the context of other statistics and diagnostics. Thus, findings in this re g a rd cannot be used to establish causationexperimental methodologies are re q u i red to establish a causative relationship. 6,10-13

■ ■ The Linear Regression Model
The development of a regression model begins with a literature review to identify the appropriate variables. In describing this process, Motulsky (1995) defined a model as simply a "mathematical abstraction that is an analogy of events in the real world." 14 One recommended strategy in conducting regression analysis is to perf o rm the following in a stepwise fashion: (1) litera t u re re v i e w, (2) model specification (i.e., selection of appro p r i a t e dependent and independent variables and functional forms), (3) data collection, (4) model estimation and evaluation, and (5) result documentation. 1 Without a proper theoretical g rounding, specification errors (e.g., using an improper functional f o rm, excluding relevant variables, or including irrelevant variables or measurement biases) may result in biased statistical estimates. 15 Thus, Glantz and Slinker (1995) stated: If the [regression] model is not correctly specified, the e n t i re process of estimating parameters and testing hypotheses about these parameters rests on a false pre m i s e and may lead to a misunderstanding of the system or process being studied. In other words, an incorrect model is worse than no model. 9 Gujarati (1995) stated the importance of relying upon theore t i c a l grounds for model development: Detecting the presence of an irrelevant variable(s) is not a difficult task. But it is very important to remember that in carrying out these tests of significance we have a specific model in mind. . . . [The] data mining technique or regression fishing, grubbing, or number crunching, is not to be recommended, for if [a given variable] legitimately belonged in the model it should have been introduced to begin with. Excluding [a given variable] in the initial regression would then lead to the omission-ofrelevant-variable bias. . . . This point cannot be overe mphasized: Theory must be the guide to any model building. 1 5 Regression itself embodies a diverse range of analytical techniques, such as multivariate, proportional hazards, logistic, and nonlinear methods. 14 Despite the existence of more advanced methods, linear re g ression is the most often utilized. It describes a dependent variable as a straight-line function with respect to one or more independent variables. Simple linear regression refers to a specific case wherein a linear relationship is examined between only one dependent variable and one independent variable. An extension of simple regression is a multivariate (or multiple) regression that involves more than one explanatory variable. Adding predictor variables such as comorbidities, risk factors, and demographics thus establishes a multivariate model.
The premise behind regression is the estimation of a "line of best fit" through a set of data points and is modeled from a p o p u l a t i o n e q u a t i o n . 9 For a simple linear re g ression, this population model would be as follows: where Y is the dependent variable of interest; α is the intercept, or constant, of the equation; β is the slope coefficient, often referred to simply as "beta"; X is the independent variable of interest; and ε is the residual, disturbance, or error term of the equation. 6 In simple linear regression (with 1 independent variable), the line of best fit represents a 1-dimensional line drawn in a 2-dimensional space (an XY graph). In a multivariate regression analysis, the line of best fit actually refers to a 2-dimensional plane when there are 2 independent variables and an n-dimensional object when there are n independent variables, while the β estimates re p resent the effects of 1 independent variable on the dependent variable while all other independent variables are held constant. When a sample of data is used to estimate the population parameters α and β, the estimates are denoted a for α and b for β; the estimated value of the dependent variable is often expressed with a caret as Y and called "Y-hat." 6 Although numerous methods may be used, that of least squares (termed ordinary least squares [OLS]) is commonly used to estimate the coefficients of the regression model. Other than stating that the OLS criterion renders estimates for α and β that minimize the sum of the squared error term, the mathematical foundations of this approach are Scatterplot of Data Points and a Line of Best Fit 1,6,9

FIGURE 2
The residual, e, is the difference in the actual (or observed) value of the dependent variable and the estimated (or predicted) value from the line of best fit.
The slope, b, is interpreted as the change in the dependent variable, Y, for a given change in the independent variable, X.
The straight line equation is estimated as Y = a + bX.
The intercept, a, is the point of the line' s intersection with the Y-axis. beyond the scope of this paper. As such, re s e a rchers are encouraged to consult other sources for a more comprehensive tre a t m e n t of the fundamentals of coefficient estimation. 6,16,17 A graphical depiction of a set of data points with a line of best fit, shown in Figure 2, builds upon the presentation of the simple linear model equation and illustrates several important components of regression. The intercept of the regression (i.e., the α coefficient) is defined as the point estimate wherein the line of best fit intersects the Y-axis, thus being the value of the dependent variable when the independent variable is equal to zero. The slope, or β coefficient, indicates the change in the dependent variable per change in the independent variable. The slope is fundamentally related to correlation: a positive β corresponds to a positive correlation and a negative β corresponds to a negative correlation. Beyond this, however, the β coefficient is interpreted differently. In a straightforward application of the simple linear model without transformations, β is interpreted as follows: given that the dependent variable (i.e., Y) is modeled by a linear function with the independent variable (i.e., X), Y increases by β units for every 1 unit increase in X. The interpretation of the coefficient estimates changes when different types or transformations of variables are i n t ro d u c e d into the model, a topic that is addressed in the section "Logarithmic Transformation of Data" later in this paper. E x a m p l e . In a simple linear model in which the dependent and independent variables are continuous and linearly related (i.e., Y = a + bX), the value for the intercept, a, is m e rely the value of Y when X equals zero. The interpre t a t i o n of the β coefficient, b, is that if X increases by 1 unit, Y increases by b units. The results of a regression analysis generally should include the coefficient of multiple determination (i.e., R 2 or R 2 adjusted, as appropriate) and should consider whether the overall regression equation was statistically significant via an F statistic. It is n e c e s s a ry to re p o rt the following statistics for specific coeff i c i e n t s : the parameter estimate, the standard error, the t-statistic value (i.e., the unstandardized parameter estimate divided by the standard error), and the P value. Given that the parameter c o e fficient obtained from the analysis is a point estimate, l o w e r-and upper-bound confidence intervals may also be re p o rted so that an estimated range of values likely to include a population parameter is presented. Additionally, the practical significance versus statistical significance of the findings must also be assessed. 1 8 , 1 9 ■ ■ Assumptions of Regression Analysis R e s e a rchers must remain aware of the assumptions of re g re s s i o n models since violations may bias and render false interpre t a t i o n s of the coefficient estimates. The key assumptions re q u i red for least squares analysis were reviewed by Gujarati (1995) and are consistent with those of Johnston (1984) and Greene (1997): 1 5 -1 7 1. The parameter coefficients of the model describe a linear relationship between the dependent and independent variables, described as follows: The predictors of the model, X, are understood to be nonstochastic (i.e., nonrandom); 3. The conditional mean of the error term (the expected value of the residual ε i for any given value of the independent variable) is equal to zero, as follows: The residual term has a constant variance across observ a t i o n s , denoted as lacking h e t e ro s c e d a s t i c i t y (nonconstant variance). This is represented mathematically as the conditional variance, var, equal to the constant squared term of the standard deviation, σ, as follows: var(ε i | X i ) = σ 2 5. The residual terms of any 2 observations (i.e., X i and X j w h e re i≠j) are independent and unrelated, indicating a random distribution or lack of a u t o c o rre l a t i o n (i.e., corre l a t i o n between observations). Mathematically, the lack of corre l a t i o n in the residuals of 2 observations is re p resented as follows, wherein the conditional covariance of 2 observations with respect to the residual terms is equal to zero: cov(ε i , ε j | X i , X j ) = 0 6. The residual term is unrelated to all independent variables. That is, the conditional covariance with respect to an independent variable and the residual is equal to zero: cov(ε i , X i ) = 0 7. The residual term, ε i , is normally distributed; 8. No perfect linear correlation exists between any of the independent variables (i.e., no m u l t i c o l l i n e a r i t y is pre s e n t ) . 1 5 -1 7 Gujarati (1995) also stated that, in order to derive parameter estimates, the number of observations must exceed the number of parameters estimated. 15 The number of observations per independent variable required for reasonable precision of the estimates is a function of the effect size that exists in the population (i.e., the slope of Y versus X i ). A general guideline offered by statisticians in the social sciences is that a minimum number of 15 observations per independent variable is required. The adequacy of this value is also contingent upon the magnitude of the multiple population correlation coefficient (defined as quantifying the degree of linear re l a t i o n s h i p between one variable and several others) 20,21 and other factors that must be considered in a robust power analysis (i.e., an analysis of the ability of the study to demonstrate an association if one exists). 22 Given that parametric methods (methods that depend upon assumption[s] about the distribution of the data) are often used in conjunction with re g ression techniques, the following assumptions for the use of parametric test statistics should also be considered when conducting re g ression analyses: (1) random sampling of data, (2) independence of data in the sample, (3) normal distribution of sample means, and (4) interval scale data (an example of an interval scale is the Fahrenheit temperature scale). 20 The assumption of normally distributed sample means is often achieved through the law of large numbers (a theorem that states that a mean value from a randomly drawn sample converges in probability to a population mean when the sample size is large). 17 Within regression, numerous categorical variables can be added through dummy codes, which ultimately loosen the interval scale requirement and add increased applicability to the procedure. 23 Essentially, a dummy code creates a binary categorization by assigning either a 0 or a 1 to respective groups such as male and female or one therapeutic agent versus another; investigators should refer to other sources for a more complete discussion of this issue. 6,9,23 Dummy coding establishes the inclusion of an important qualitative component in multivariate re g ressions, allowing re s e a rchers to investigate pro c e s s e s beyond continuous variables. Furthermore, dummy coding allows the investigator to test for differences between 2 or more groups, thus employing elements of a t test or F test.
A linear relationship between the dependent and independent variable (item 1 in the above list of key assumptions) is often simply assumed-other functions (e.g., exponential terms) or transform ations (e.g., logarithmic transformations) may be more appro p r i a t e l y employed to model the data and should often be explore d .
H o w e v e r, when modeling the variables with other mathematical functions, the other assumptions of OLS analysis continue to play an important role, especially those concerning the residuals (items 3-7 in the above list of key assumptions).
The assumptions and diagnostic tools of regression rest heavily upon the residual term of the model. In the linear regression model the residuals are assumed to be normally and independently distributed, with a mean of zero and a constant variance, which is typically denoted by econometricians as follows: where ε is the residual, ~ denotes "approximately," NID refers to "normally and independently distributed," 0 is the expected average value of the residual (i.e., zero), and σ 2 is the (constant) variance of the residual, defined as the squared value of the standard deviation; notably, the standard deviation of the residual is also another measure of a model' s "goodness of fit." 16,17 According to Gauss-Markov theory, the best linear unbiased estimates (BLUE) or minimum variance linear unbiased estimates (MVLUE) for the coefficients of the linear regression model follow a similar form, with both the alpha and beta coefficients assumed to be normally distributed, N, with a mean estimate of α or β, and a constant variance: where a is the parameter estimate for the Y-intercept; α is the population parameter for the Y-intercept; b is the parameter estimate for the slope; β is the population parameter for the slope; σ 2 is the variance of the residual term; x is the deviation of the independent variable from the sample mean, or (X -X); X is the mean value for the independent variable X; and n is the number of observations. 15-17 A major goal, then, of assessing residuals is to ensure that BLUE/MVLUE is achieved concerning the parameter estimates a and b.
The assumption of multicollinearity (item number 8 in the above list of key assumptions) specifically involves instances in which the independent variables are highly or perfectly correlated with each other and the individual effects of each cannot be estimated with precision. 15,17,24 Thus, intercorrelated independent variables may decrease both the precision and accuracy of coefficients that are estimated in the regression model. Multicollinearity may arise from the following: • the method of data collection (e.g., employing a sampling technique with an inappropriately limited number of values), • regression model constraints or constraints in the sample population, • model misspecification, and • over-determined modeling (wherein a regression model has more independent variables than the number of observations). 25 However, multicollinearity substantially affects the coefficient estimates only when it is present at high levels. 15 The only substantial effect of multicollinearity is that the standard errors of the regression coefficients may be unusually large and an ideal and efficient solution may not exist. 26 Several methods are available for detection of multicollinearity, which may include observing (1) a large R 2 value with relatively few statistically significant t ratios, (2) high pairwise correlations between independent variables, or (3) either a high value for the condition index or the variance inflation factor (VIF) (e.g., values of the condition index ≥ 15 or VIF ≥ 10). 1,[15][16][17] Researchers should remain aware that ignoring multicolinearity may yield large variances for parameter estimates, resulting in statistically insignificant parameter coefficients and wide confidence interv a l s . Despite being able to recognize the presence of this correlation, controlling for it is challenging and may include transforming or combining variables, imposing a priori restrictions, incre a s i n g the number of observations in the study, or adding or dro p p i n g a variable. 15

■ ■ Regression Diagnostics via Residual Analyses
Residual analyses afford researchers the ability to test explicitly for the assumptions of constant variance (homoscedasticity, or same scattering), independence, linearity, and normality. 17 These diagnostics are warranted in any investigation that employs regression procedures. Residuals are defined as the error terms from a regression equation and represent the difference between (a) the actual value of the dependent variable and (b) the value of the dependent variable that is estimated from the line of best fit (i.e., e = Y observed -Y predicte d). If the assumptions of linear re g re ss i o n are violated or if the regression equation inappropriately models a set of data, anomalies will be observed in the error term. Outliers within the dataset can also be identified through a residual assessment. The relevance of diagnostic analyses extends beyond those that assess the goodness of fit (e.g., analysis of overall F tests and R 2 values in addition to graphical and formal statistical analyses of scatter plots and residuals). Based upon the findings of the residual analysis, researchers may be directed toward examining alternate methods of properly s p e cifying the regression equation, which may require the use of advanced mathematical and transformational techniques (e.g., quadratic terms, exponentials, or logarithmic transformations). 6,27 The values observed in the error term should be stochastic or random in nature. Obtaining a graph of the residual plots is a preliminary method for verifying that the assumptions of the regression model have been met and that the parameters are the best linear unbiased estimates. Investigators should initially plot the standardized residuals versus the predicted values of the dependent variable; nonstandardized residual plots may also be utilized and follow a similar interpretation. 7 Figure 3 presents various scatter plots of the standardized residual (error term) against the predicted value of the dependent variable. An ideal pattern would be one that is stochastic and randomly distributed around an average value of zero ( Figure  3A). When a straight-line regression equation is used to model a set of data that is nonlinear, or when the residuals are systematically related to other cases within the set (i.e., autocorre l a t e d ) , a curved pattern in the residuals may be observed, as illustrated in Figure 3B. Another example of autocorrelated residuals appears in Figure 3C, as a distinct relationship appears in the error term (in this instance, a sine wave).
The presence of a funnel-shaped pattern ( Figure 3D) may indicate a violation of the constant variance assumption; the illustration depicts an increased dispersion or variance of the residuals as the predicted value increases. Importantly, logarithmic transformations of variables often simultaneously stabilize the variance and normalize the data. Tests of normality may be conducted by viewing a histogram of the error terms or normal quantile-comparison plots in addition to formal statistical methods (e.g., correlation test for normality). 1,6,16,17 The presence of an outlier in the set is depicted in Figure 3E. Researchers should analyze individual cases to determine if they may be attributed to error or if they are indeed valid observations. A number of advanced techniques exist that allow the researcher to detect and correct for violations of the regression model although a formal description of these techniques is beyond the scope of this paper. Residual analyses provide a means of testing many assumptions of a regression model, but the challenge often resides in how to control for violations once they have been identified. The appropriate techniques are often highly specific for a given violation and researchers must first begin by understanding any underlying theory behind the na t u re of the variables of interest. Relatively simple transform a t i o n s of one or more of the variables may rectify common violations.
Serial correlation or autocorrelation is a particular violation of the independence assumption (e.g., if the residuals are related between observations). 17,28 The most common formal statistical test for autocorrelated residual terms is the Durbin-Watson d test, whereas correcting for autocorrelation often includes employing generalized least squares (GLS) methods. 1,6,16,17 Other common methods used to quantitatively test for the presence of serial correlation include the runs test (Geary test) and the Breusch-Godfrey test. [29][30][31] The method of rectifying autocorrelation is dependent upon the specific interrelationship and structure of the serial correlation itself. Two steps are typically required: f i r s t , obtain or estimate a c o e fficient of autocovariance, and, second, u s e these values to transform the original re g ression equation.
In calculating the coefficient of autocovariance, time-related processes may follow autoregressive functions that involve distributive lag functions (wherein the dependent variable is systematically related to past values of the dependent variable), moving average schemes (that involve the dependent term' s being functionally related to past values of the error term), combinations (autoregressive moving average), and numerous o t h e r s . 3 2 Tr a n s f o rmations to re g ression equations that incorporate estimations of the coefficient of autocovariance typically follow applications of GLS, such as feasible generalized least squares.
When the constant variance assumption is violated (e.g., if the variance of the residual is not equal between cases), heteroscedasticity (i.e., unequal variance, or different scattering) is present. Detection of heteroscedasticity may be undertaken through quantitative procedures such as the following: Park test, Glejser test, Breusch-Pagan/Godfrey test, Goldfeld-Quandt test, or White' s general test. 33-37 The large number of statistical tests from which to choose is due to the diverse nature of heteroscedasticity. In a method that is somewhat similar to controlling for autocorrelation, GLS is often used to correct for unequal variance via weighted least squares estimators. H e t e ro s c e d a s t i c i t y -robust statistical tests have also been developed and may be considered by analysts (e.g., Eicker-H u b e r-White standard errors). [38][39][40] Additionally, procedures have been reported that control for concomitant violations of both serial c o rrelation and unequal variance (e.g., autore g ressive conditional heteroscedasticity model). 41

■ ■ Logarithmic Transformation of Data
All phenomena do not necessarily conform to straight-line functions. Health care utilization or cost data are rarely norm a l l y distributed and are often characterized by heteroscedasticity ( d i ff e rent scattering) and distributions that are positively skewed, thus warranting transformation to reach the assumption of normality. [42][43][44] Departures from linearity may be investigated via regression, providing that the proper mathematical o p e r a n d s a re expressed within the model. Various transformations of variables may also be employed to create linear functions in instances that involve curvilinear or nonlinear pro c e s s e s , skewed distributions, or heteroscedastic residuals; transform a t i o n s may additionally resolve violations of the OLS estimation of a regression model and can essentially change a nonlinear form to a linear one. To illustrate, if an observed relationship exists between the variables as depicted in Figure 4A, a logarithmic transformation (often with a Naperian or natural logarithm, ln) may establish a linear function between variables as indicated in F i g u re 4B. A re p resentation of the residuals for each of the models appears in Figure 5. A curvilinear relationship in the re s i d u a l plot of the untransformed data ( Figure 5A) is due to model misspecification, (i.e., the incorrect assignment of a linear re g re s s i o n model to nonlinear data).
When the dependent variable of the model is transformed via a natural logarithm or Briggsian/common logarithm (log, base 10), the resulting model is commonly denoted as a  "log-linear" or "log-lin" model. In economics, the coefficients are loosely defined as a semielasticity. 45,46 When both the dependent and independent variables have been log-transformed, the resulting coefficients are interpreted as a constant elasticity or, simply, an elasticity. If transformations are required to meet the assumptions of a linear model, the researcher must interpret the coefficient estimates pro p e r l y. Whenever transformations are implemented, the data units change, so that coefficient estimates from logarithmic models cannot be interpreted in an identical fashion to coefficients derived from simple linear models. Table 1 presents a summary of the interpretations of various coefficients given log-transformed variables. Importantly, when logarithmic transformations are used, researchers cannot estimate actual values of a variable by simply applying an antilog. This method introduces a substantial retransformation bias, which may often be overcome by applying a smearing estimation, as proposed by Duan (1983). 4 7 -4 9 D u a n 's smearing estimate has re c e i v e d i n c reased attention in pharmacoeconomics and outcomes research, as health care decision makers often desire regression parameter coefficients to be presented in units rather than as percentage comparators. However, an important caveat in the use of Duan' s smearing estimator is that it has been found to introduce a substantial bias when heteroscedasticity is present. F u rt h e rm o re, robust estimators (e.g., Eicker-H u b e r-White standard errors) often do not correct the bias. In instances w h e re unequal variance is noted, alternatives to Duan' s approach may include natural log transformations with an OLS estimation, GLS and related extensions (e.g., gamma or Weibull regression with log link), or the Cox proportional hazards model. [50][51][52] Unfortunately, empirical analyses also indicate that no single best model is robust under all specifications. [50][51][52] Example. In reference to Table 1, if a β coefficient estimate obtained from a set of data was +0.250 and statistically significant, a lin-lin interpretation would be that a 1-unit increase in the independent variable (X) is associated with a 0.250-unit increase in the dependent variable (Y). For a lin-log equation wherein only the independent variable is log-transformed, the interpre t a t i o n is that a 1% increase in the independent variable (X) is associated with a 0.0025-unit change in the dependent variable (Y) (i.e., 0.250 ÷ 100). The log-lin interpretation of a log-transformed dependent variable is such that a 1-unit change in the independent variable (X) is associated with a 25% increase in the dependent variable (Y) (i.e., 0.250x100), and is considered a semielasticity. Finally, for a log-log model wherein both dependent and independent variables are log-transformed, the interpretation is that a 1% increase in the independent variable (X) is associated with a 0.25% increase in the dependent variable (Y), which is interpreted as a constant elasticity. A concern when employing transformations to variables occurs if there are any values in the dataset that are not subject to the mathematical function. 27,33 To illustrate, the logarithm of zero or any negative value is mathematically undefined. Thus, researchers attempting to apply a logarithmic transformation to a variable that has negative values or is zero would essentially exclude these cases, which may appear as a clustered region in the residual plots. This issue is of particular concern for data involving health care utilization or expenditure data, as many patients or medical beneficiaries go through sustained periods without utilization or expenditures. Addressing this issue often involves the use of advanced statistical pro c e d u re s . 4 3 , 4 8 , 4 9 Methods used by researchers have included employing a log plus constant model (e.g., $1 added to all costs to allow a logarithmic transformation), a 2-part model (wherein the first model uses a logistic regression to predict the probability of costs or utilization and a second model estimates the actual level of use for subjects that incur costs or utilization), survival analysis techniques, or generalized linear models with related v a r i a n t s . 4 3 , 4 9 Consideration of gamma, negative-binomial, or Poisson distributions of cost or utilization data may be incorpor a t e d to yield best linear unbiased estimates. Although all of these m e t h o d s have appeared in the literature, no single best approach may be recommended nor do the methods represent an exhaustive list. Applying any given method without assessing the fundamental assumptions of the statistical test may produce misleading results and conclusions.

Logarithmic Transformation of Data to Yield a Linear Function
Regarding broader elements of forecasting, censored data ( w h e rein data truncation occurs due to death or lack of follow-up) is a cause of concern. 53-55 Empiric research has found that the

■ ■ Applications in Managed Care
In a purely hypothetical scenario with relevance to managed care or formulary decision making, an analyst may want to use a managed care dataset to ascertain whether health care cost differences exist between therapeutic options for patients with heart failure. Thus, the dependent variable of interest would be total health care costs. Furthermore, the analyst may also be i n t e rested in determining the related predictors of hospitalization, defining whether hospitalization occurred as a second dependent variable that is dichotomous (i.e., hospitalization = 1 , no hospitalization = 0). Given that retrospective analysis of administrative claims data lacks randomization and fully experimental methodologies, relevant confounding variables must be included within the re g ression model to statistically control for diff e re n c e s between study gro u p s . 1 1 Thus, in this example, simply comparing unadjusted total health care costs (or whether a patient was or was not hospitalized) between the therapeutic options without c o n t rolling for confounders is inappropriate; re g ression analyses are required. Contingent upon a thorough literature review of predictors of cost and/or hospitalization in heart failure patients, the specification of a regression model in this hypothetical case should provide a theoretical basis for the variables that are ultimately included in the analysis. In pharmacoeconomic or outcomes research, these independent variables may include risk adjustment measures to control for case mix severity (e.g., Chronic Disease Score [CDS] based upon prescription drug use, Charlson Index based upon International Classification of Diseases, 9th Revision, Clinical Modification codes), patient characteristics (e.g., age, sex), comorbid conditions, pre t re a t m e n t costs, and treatment gro u p s . 5 8 -6 0 A d d i t i o n a l l y, medication adhere n c e (e.g., Medication Possession Ratio) and insurance or provider characteristics (e.g., type of health care organization, Medicare, Medicaid, prescriber specialty) may be deemed i m p o rtant to consider within a re s e a rch question. 6  Ex p e n d i t u res = α + β 1 C D S + β 2 Age + β 3 S ex + β 4 C o m o r b i d i t y + β 5 Treatment One + β 6 Treatment Two + ε w h e re Sex, Comorbidity, Tre a t m e n t O n e, and Tre a t m e n t Tw o a re defined as dichotomous dummy variables (variables with only 2 values: 0 or 1). Treatment Three is considered the baseline used for comparison. Given that the total number of dummy variables re q u i red is equal to 1 less than the total number of categories to be compared, a single dummy variable is used for Sex, while 2 dummy variables are used for the treatment groups. Comorbidities may also be coded as dummy variables, with the presence of the condition coded as 1 and absence coded as 0. Examples relevant to heart failure may include a past history of myocardial infarction or stroke and can extend to include the presence of diabetes, atrial fibrillation, renal disease, or hypertension.
Concerning the second dependent variable in the example (i.e., hospitalization), when binary categorical variables are used as a dependent variable, a logistic regression may be considered; this may also be extended to ascertain predictors of treatment success (i.e., treatment success = 1, treatment failure = 0). 62 Example 1. Armstrong and Malone (2002) conducted an assessment of asthma-related costs associated with fluticasone versus leukotriene modifier use, in which the dependent variable was log-transformed post-asthma cost and independent variables included age, sex, l o g -t r a n s f o rmed pre-asthma cost, CDS, presence of chronic obstructive pulmonary disease, treatment group (i.e., fluticasone or leukotriene modifier), and the use of c e rtain medications prior to the study period (i.e., number of short-acting β-agonist canisters used and dummy codes representing the use of long-acting β-agonists, theophylline or mast cell stabilizers, or oral corticosteroids). 63 Example 2. McLaughlin, Eaddy, and Grudzinski (2004) analyzed depre s s i o n -related charges associated with t reatment with sertraline and citalopram. 6 4 The dependent variable, treatment charges, was transformed via natural logarithm specifically due to the detection of heteroscedasticity. The independent variables included age, sex, geographic region, natural log of treatment c h a rges prior to the study period, the presence of comorbidites (either mental or nonmental health), the type of managed care institution, the physician specialty, the use of emergency department or hospital services prior to the study period, and the year of initial diagnosis.

■ ■ Conclusion
R e g ression is a powerful and commonly used statistical technique that gives researchers the ability to quantify mathematical relationships for purposes of description, hypothesis testing, or p rediction. Before a proper analysis can begin, an u n d e r s t a n d i n g is necessary of correlation, of the assumptions of a classical linear regression model, and of the importance of residual assessments. Despite its complexities, the flexibility of regression makes it especially applicable in certain settings, enabling decision makers to analyze complex phenomena and answer questions that other statistical methods inadequately address. Contingent upon the specific research question, a number of extensions to basic regression techniques may be explored so that this method can be used appropriately in p h a rmacoeconomic or outcomes re s e a rch and assist in form u l ary decision making. The increased availability of administrative databases containing medical and pharmacy claims data may provide those in managed care settings with a greater ability to evaluate treatments and practice patterns. Given that administrative data are observational rather than experimental, it is critical that analysts and decision makers be versed in appropriate statistical methods to design investigations or evaluate empiric findings.

DISCLOSURES
No outside funding supported this study. The author discloses no potential bias or conflict of interest relating to this article.